↓↓ Accès au projet ↓↓
Lien GitHub du projet


# Introduction


Dans le cadre du module de statistiques appliquées, nous avions abordé les bases des statistiques et pris en main des outils permettant d’effectuer des operation statistique avancée grâce au langage R, à partir d’un jeu de données. Ainsi, notre choix s’est porté sur des données statistiques relatives à je sais pas. Celles-ci proviennent du site internet je sais pas.

Ce projet de statistiques appliquées consiste en l’utilisation d’outils R adaptés afin d’effectuer des régression linéaire sur notre jeu de données et produire un rapport.



# Outils et environnement de travail

R et RStudio

R est un langage de programmation dont le but est de pouvoir traiter et organiser des jeux de données afin de pouvoir y appliquer des tests statistiques plus ou moins complexes et se représenter ces données graphiquement à l’aide d’une grande variété de graphiques disponibles. RStudio est une application proposant un environnement de développement et des outils adaptés au langage et à l’environnement de programmation R.

La fonction principale de RStudio consiste à faciliter le développement d’applications en langage R. Pour ce faire, le programme dispose de nombreux outils qui vous permettent notamment de créer des scripts, compiler du code, créer des graphes, ainsi que de travailler avec divers jeux de données.


R markdown

L’extension R markdown permet de générer des documents de manière dynamique en mélangeant texte mis en forme et résultats produits par du code R. Les documents générés peuvent être au format HTML, PDF, Word, et bien d’autres. C’est donc un outil très pratique pour l’exportation, la communication et la diffusion de résultats d’analyse.


Outils complémentaires





Régression linéaire


data <- read_csv("data/prestige.csv") %>% as_tibble()
data
## # A tibble: 102 x 6
##    education income women prestige census type 
##        <dbl>  <dbl> <dbl>    <dbl>  <dbl> <chr>
##  1      13.1  12351 11.2      68.8   1113 prof 
##  2      12.3  25879  4.02     69.1   1130 prof 
##  3      12.8   9271 15.7      63.4   1171 prof 
##  4      11.4   8865  9.11     56.8   1175 prof 
##  5      14.6   8403 11.7      73.5   2111 prof 
##  6      15.6  11030  5.13     77.6   2113 prof 
##  7      15.1   8258 25.6      72.6   2133 prof 
##  8      15.4  14163  2.69     78.1   2141 prof 
##  9      14.5  11377  1.03     73.1   2143 prof 
## 10      14.6  11023  0.94     68.8   2153 prof 
## # … with 92 more rows
education <- data %>% 
  ggplot(aes(x = education, y = prestige)) +
  geom_point() +
  geom_smooth(method = loess, color = "red", fill = "#69b3a2", se = T) +
  geom_smooth(method = lm, color = "blue", se = F) +
  theme_ipsum()

scatter_edu <- ggplotly(education)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
scatter_edu
income <- data %>% 
  ggplot(aes(x = income, y = prestige)) +
  geom_point() +
  geom_smooth(method = loess, color = "red", fill = "#69b3a2", se = T) +
  geom_smooth(method = lm, color = "blue", se = F) +
  theme_ipsum()

scatter_inc <- ggplotly(income)
## `geom_smooth()` using formula 'y ~ x'
## `geom_smooth()` using formula 'y ~ x'
scatter_inc
prest.lm1 <- lm(prestige~education, data = Prestige)
prest.lm1
## 
## Call:
## lm(formula = prestige ~ education, data = Prestige)
## 
## Coefficients:
## (Intercept)    education  
##     -10.732        5.361
bacf <- acf(residuals(prest.lm1), plot = F)
bacfdf <- with(bacf, data.frame(lag, acf))
conf.level <- 0.95
ciline <- qnorm((1 - conf.level)/2)/sqrt(length(residuals(prest.lm1)))

lag_plot <- bacfdf %>%
  ggplot(aes(x = lag, y = acf)) +
  geom_hline(aes(yintercept = 0)) +
  geom_segment(aes(xend = lag, yend = 0)) +
  geom_hline(aes(yintercept = ciline), linetype = 3, color = 'darkblue') +     
  geom_hline(aes(yintercept = -ciline), linetype = 3, color = 'darkblue') +
  theme_ipsum()

lag_plot <- ggplotly(lag_plot, tooltip = c("lag", "acf"))
lag_plot
durbinWatsonTest(prest.lm1)
##  lag Autocorrelation D-W Statistic p-value
##    1       0.2752512       1.43917   0.002
##  Alternative hypothesis: rho != 0
qq_edu <- prest.lm1 %>%
  ggplot(aes(sample = residuals(prest.lm1) / 10))+
  stat_qq() + stat_qq_line() +
  #theme_ipsum() +
  labs(x = "Theoretical Quantiles\nlm(prestige ~ education)", 
       y = "Standardized residuals")

#qq_edu <- ggplotly(qq_edu)
qq_edu

shapiro.test(residuals(prest.lm1))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(prest.lm1)
## W = 0.98065, p-value = 0.1406
prest.lm2 <- lm(prestige~income, data = Prestige)
prest.lm2
## 
## Call:
## lm(formula = prestige ~ income, data = Prestige)
## 
## Coefficients:
## (Intercept)       income  
##   27.141176     0.002897
qq_inc <- prest.lm2 %>%
  ggplot(aes(sample = residuals(prest.lm2) / 10))+
  stat_qq() + stat_qq_line() +
  #theme_ipsum() +
  labs(x = "Theoretical Quantiles\nlm(prestige ~ income)", 
       y = "Standardized residuals")

#qq_inc <- ggplotly(qq_inc)
qq_inc

shapiro.test(residuals(prest.lm2))
## 
##  Shapiro-Wilk normality test
## 
## data:  residuals(prest.lm2)
## W = 0.97169, p-value = 0.02729
plot(prest.lm1, 3)

ncvTest(prest.lm1)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.6327545, Df = 1, p = 0.42635
plot(prest.lm1,1)

summary(prest.lm1)
## 
## Call:
## lm(formula = prestige ~ education, data = Prestige)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -26.0397  -6.5228   0.6611   6.7430  18.1636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -10.732      3.677  -2.919  0.00434 ** 
## education      5.361      0.332  16.148  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 9.103 on 100 degrees of freedom
## Multiple R-squared:  0.7228, Adjusted R-squared:   0.72 
## F-statistic: 260.8 on 1 and 100 DF,  p-value: < 2.2e-16
confint(prest.lm1)
##                  2.5 %    97.5 %
## (Intercept) -18.027220 -3.436744
## education     4.702223  6.019533
influenceIndexPlot(prest.lm1)

outlierTest(prest.lm1)
## No Studentized residuals with Bonferroni p < 0.05
## Largest |rstudent|:
##          rstudent unadjusted p-value Bonferroni p
## newsboys -2.98896          0.0035306      0.36012
prest.lm1bis <- lm(prestige~education, data=Prestige[-c(53,67),])
compareCoefs(prest.lm1 ,prest.lm1bis) 
## Calls:
## 1: lm(formula = prestige ~ education, data = Prestige)
## 2: lm(formula = prestige ~ education, data = Prestige[-c(53, 67), ])
## 
##             Model 1 Model 2
## (Intercept)  -10.73  -11.26
## SE             3.68    3.53
##                            
## education     5.361   5.417
## SE            0.332   0.318
## 
my_df <- data.frame(education=c(10.25))
predict(prest.lm1, newdata=my_df)
##        1 
## 44.21701
predict(prest.lm1, newdata=my_df, interval="prediction")
##        fit      lwr      upr
## 1 44.21701 26.06518 62.36885
predict(prest.lm1, newdata=my_df, interval="confidence")
##        fit      lwr      upr
## 1 44.21701 42.40008 46.03395
my_df <- data.frame(education=c(10.25, 11.25, 12.25))
predict(prest.lm1, newdata=my_df,interval="confidence")
##        fit      lwr      upr
## 1 44.21701 42.40008 46.03395
## 2 49.57789 47.75810 51.39768
## 3 54.93877 52.89190 56.98564
my_pres <- Prestige
my_pres$res <-residuals(prest.lm1)
head(my_pres)
##                     education income women prestige census type       res
## gov.administrators      13.11  12351 11.16     68.8   1113 prof  9.250875
## general.managers        12.26  25879  4.02     69.1   1130 prof 14.107621
## accountants             12.77   9271 15.70     63.4   1171 prof  5.673573
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof  6.310758
## chemists                14.62   8403 11.68     73.5   2111 prof  5.855950
## physicists              15.64  11030  5.13     77.6   2113 prof  4.487854
my_pres$fitted <-fitted(prest.lm1)
head(my_pres)
##                     education income women prestige census type       res
## gov.administrators      13.11  12351 11.16     68.8   1113 prof  9.250875
## general.managers        12.26  25879  4.02     69.1   1130 prof 14.107621
## accountants             12.77   9271 15.70     63.4   1171 prof  5.673573
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof  6.310758
## chemists                14.62   8403 11.68     73.5   2111 prof  5.855950
## physicists              15.64  11030  5.13     77.6   2113 prof  4.487854
##                       fitted
## gov.administrators  59.54913
## general.managers    54.99238
## accountants         57.72643
## purchasing.officers 50.48924
## chemists            67.64405
## physicists          73.11215
head(predict(prest.lm1))
##  gov.administrators    general.managers         accountants purchasing.officers 
##            59.54913            54.99238            57.72643            50.48924 
##            chemists          physicists 
##            67.64405            73.11215
head(fitted(prest.lm1))
##  gov.administrators    general.managers         accountants purchasing.officers 
##            59.54913            54.99238            57.72643            50.48924 
##            chemists          physicists 
##            67.64405            73.11215
vcov(prest.lm1)
##             (Intercept)  education
## (Intercept)   13.520978 -1.1835055
## education     -1.183505  0.1102162
ggplot(Prestige, aes(y=prestige, x=education))+
  geom_point()+
  geom_smooth(colour="red", method="lm", fill="red") +
  ylab("Prestige")+
  xlab("education") +
  theme_classic()+
  annotate("text", x = 9, y = 80, label = "prestige = -10.73 + 5.36 * education\n (pval<0.001)")
## `geom_smooth()` using formula 'y ~ x'

int_pred <- predict(prest.lm1, interval="prediction")
## Warning in predict.lm(prest.lm1, interval = "prediction"): predictions on current data refer to _future_ responses
my_prest2 <-cbind(Prestige, int_pred)
head(my_prest2)
##                     education income women prestige census type      fit
## gov.administrators      13.11  12351 11.16     68.8   1113 prof 59.54913
## general.managers        12.26  25879  4.02     69.1   1130 prof 54.99238
## accountants             12.77   9271 15.70     63.4   1171 prof 57.72643
## purchasing.officers     11.42   8865  9.11     56.8   1175 prof 50.48924
## chemists                14.62   8403 11.68     73.5   2111 prof 67.64405
## physicists              15.64  11030  5.13     77.6   2113 prof 73.11215
##                          lwr      upr
## gov.administrators  41.33302 77.76523
## general.managers    36.81573 73.16903
## accountants         39.52816 75.92469
## purchasing.officers 32.33470 68.64379
## chemists            49.31584 85.97226
## physicists          54.67820 91.54609
ggplot(my_prest2, aes(y=prestige, x=education))+
  geom_point()+
  geom_smooth(colour="red", method="lm", fill="red") +
  geom_line(aes(y=lwr), color = "red", linetype = "dashed")+
  geom_line(aes(y=upr), color = "red", linetype = "dashed")+    
  ylab("Prestige")+
  xlab("education") +
  theme_classic()+
  annotate("text", x = 9, y = 80, label = "prestige = -10.73 + 5.36 * education\n (pval<0.001)")
## `geom_smooth()` using formula 'y ~ x'